Imports Library

In [14]:
import sys
import pandas as pd
import os
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import cv2
import tensorflow as tf
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.applications import DenseNet121
from tensorflow.keras.preprocessing.image import ImageDataGenerator

1. Exploratory Data Analysis (EDA)¶

📊 Weight Dataset Processing and Cleaning¶

Overview¶

This notebook loads and preprocesses weight dataset metadata from an Excel file.
It includes data cleaning, error correction, and image path extraction.

Steps:¶

1️⃣ Load the dataset from an Excel file 📂
2️⃣ Convert weight values from string (with commas) to numeric format 🔢
3️⃣ Fix known errors (e.g., incorrect weight entries) 🛠️
4️⃣ Extract image file paths for further processing 🖼️

In [ ]:
print(sys.executable)

def load_metadata(excel_path):
    """
    Loads the metadata from the given Excel file
    and returns a cleaned pandas DataFrame.
    """
    df = pd.read_excel(excel_path)

    # Fix error found from the steps below, and rerun -> 4,75
    df["weight"] = df["weight"].astype(str).apply(lambda x: x.replace(",", "."))

    # Convert 'weight' column to numeric (coerce errors to NaN)
    df['weight'] = pd.to_numeric(df['weight'], errors='coerce')

    # # Fix one entry from 815 lbs to 8.15 lbs
    # df.loc[df["weight"] == 815, "weight"] = 8.15

    # Extract local file path from 'Row Data'
    def get_local_path(row_data):
        file_name = row_data.split('/')[-1]
        return os.path.join('/opt/weight_dataset_v1', file_name)

    if "Row Data" in df.columns:
        df['img_path'] = df['Row Data'].apply(get_local_path)
    else:
        print("Warning: 'Row Data' column not found in the dataset.")

    return df
/bin/python3

📊 Weight Dataset: Data Exploration & Cleaning¶

📌 Overview¶

This notebook processes and explores a weight dataset from an Excel file.
The goal is to clean, analyze, and visualize the data for further use in machine learning models.

🔍 Steps¶

1️⃣ Load Data from an Excel file 📂
2️⃣ Check for Missing Values and visualize missing data 🔎
3️⃣ Identify & Handle Outliers in the weight column 🚨
4️⃣ Explore Data Distributions using plots 📊
5️⃣ Detect Duplicates and unique values 📝

In [16]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
!pip install missingno
import missingno as msno

# Load dataset
excel_path = "/opt/weight_dataset_v1/excel_dataset/data-batch-01.xlsx"
df = load_metadata(excel_path)

## 🔍 Basic Data Exploration
print("Dataset Overview:")
print(df.info())

print("\nFirst 5 Rows:")
print(df.head())

## 📊 Missing Values Analysis
print("\nMissing values per column:")
print(df.isnull().sum())

# Missing values visualization
plt.figure(figsize=(10, 5))
sns.heatmap(df.isnull(), cmap="viridis", cbar=False, yticklabels=False)
plt.title("Missing Values Heatmap")
plt.show()

## 🔢 Statistical Summary
print("\nStatistical Summary:")
print(df.describe())

# Unique values per categorical column
categorical_cols = df.select_dtypes(include=["object"]).columns
print("\nUnique values per categorical column:")
for col in categorical_cols:
    print(f"{col}: {df[col].nunique()} unique values")

## 🚩 Outlier Detection (Weight)
plt.figure(figsize=(8, 5))
sns.boxplot(x=df["weight"])
plt.title("Outlier Detection - Weight Column")
plt.show()

# Count rows where weight is above the 99th percentile
upper_bound = df["weight"].quantile(0.99)
print(f"\nRows with extremely high weight (>99th percentile {upper_bound:.2f}):")
print(df[df["weight"] > upper_bound])

## 📝 Check for Duplicates
print("\nDuplicate Rows:", df.duplicated().sum())

## 🔎 Value Distribution
plt.figure(figsize=(10, 5))
sns.histplot(df["weight"], bins=30, kde=True)
plt.title("Weight Distribution")
plt.xlabel("Weight")
plt.ylabel("Count")
plt.show()
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: missingno in ./.local/lib/python3.10/site-packages (0.5.2)
Requirement already satisfied: numpy in ./.local/lib/python3.10/site-packages (from missingno) (2.0.2)
Requirement already satisfied: matplotlib in ./.local/lib/python3.10/site-packages (from missingno) (3.10.1)
Requirement already satisfied: scipy in ./.local/lib/python3.10/site-packages (from missingno) (1.15.2)
Requirement already satisfied: seaborn in ./.local/lib/python3.10/site-packages (from missingno) (0.13.2)
Requirement already satisfied: contourpy>=1.0.1 in ./.local/lib/python3.10/site-packages (from matplotlib->missingno) (1.3.1)
Requirement already satisfied: cycler>=0.10 in ./.local/lib/python3.10/site-packages (from matplotlib->missingno) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in ./.local/lib/python3.10/site-packages (from matplotlib->missingno) (4.56.0)
Requirement already satisfied: kiwisolver>=1.3.1 in ./.local/lib/python3.10/site-packages (from matplotlib->missingno) (1.4.8)
Requirement already satisfied: packaging>=20.0 in ./.local/lib/python3.10/site-packages (from matplotlib->missingno) (24.2)
Requirement already satisfied: pillow>=8 in ./.local/lib/python3.10/site-packages (from matplotlib->missingno) (11.1.0)
Requirement already satisfied: pyparsing>=2.3.1 in /usr/lib/python3/dist-packages (from matplotlib->missingno) (2.4.7)
Requirement already satisfied: python-dateutil>=2.7 in ./.local/lib/python3.10/site-packages (from matplotlib->missingno) (2.9.0.post0)
Requirement already satisfied: pandas>=1.2 in ./.local/lib/python3.10/site-packages (from seaborn->missingno) (2.2.3)
Requirement already satisfied: pytz>=2020.1 in /usr/lib/python3/dist-packages (from pandas>=1.2->seaborn->missingno) (2022.1)
Requirement already satisfied: tzdata>=2022.7 in ./.local/lib/python3.10/site-packages (from pandas>=1.2->seaborn->missingno) (2025.1)
Requirement already satisfied: six>=1.5 in /usr/lib/python3/dist-packages (from python-dateutil>=2.7->matplotlib->missingno) (1.16.0)
Dataset Overview:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 675 entries, 0 to 674
Data columns (total 35 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   ID                   675 non-null    object 
 1   Global Key           675 non-null    object 
 2   Row Data             675 non-null    object 
 3   Dataset ID           675 non-null    object 
 4   Dataset Name         675 non-null    object 
 5   Created At           675 non-null    object 
 6   Updated At           675 non-null    object 
 7   Created By           675 non-null    object 
 8   Height               675 non-null    int64  
 9   Width                675 non-null    int64  
 10  Asset Type           675 non-null    object 
 11  MIME Type            675 non-null    object 
 12  EXIF Rotation        675 non-null    int64  
 13  Experiment ID        675 non-null    object 
 14  Experiment Name      675 non-null    object 
 15  Run Name             675 non-null    object 
 16  Run Data Row ID      675 non-null    object 
 17  Split                675 non-null    object 
 18  Label Kind           675 non-null    object 
 19  Version              675 non-null    object 
 20  Label ID             675 non-null    object 
 21  Feature ID           675 non-null    object 
 22  Feature Schema ID    675 non-null    object 
 23  Name                 675 non-null    object 
 24  Value                675 non-null    object 
 25  Annotation Kind      675 non-null    object 
 26  Bounding Box Top     675 non-null    int64  
 27  Bounding Box Left    675 non-null    int64  
 28  Bounding Box Height  675 non-null    int64  
 29  Bounding Box Width   675 non-null    int64  
 30  species              675 non-null    object 
 31  gender               123 non-null    object 
 32  color                675 non-null    object 
 33  weight               675 non-null    float64
 34  img_path             675 non-null    object 
dtypes: float64(1), int64(7), object(27)
memory usage: 184.7+ KB
None

First 5 Rows:
                          ID  \
0  clyxetrm60mcb0796rsdu4ob9   
1  clyxetrm60mcc0796uilaudlq   
2  clyxetrm60mcd0796albl43as   
3  clyxetrm60mce0796r9gf6geg   
4  clyxetrm60mcf0796zbzj1nb6   

                                          Global Key  \
0  upload-raw-images/circleseafoods-camera-03/202...   
1  upload-raw-images/circleseafoods-camera-03/202...   
2  upload-raw-images/circleseafoods-camera-03/202...   
3  upload-raw-images/circleseafoods-camera-03/202...   
4  upload-raw-images/circleseafoods-camera-03/202...   

                                            Row Data  \
0  gs://upload-raw-images/circleseafoods-camera-0...   
1  gs://upload-raw-images/circleseafoods-camera-0...   
2  gs://upload-raw-images/circleseafoods-camera-0...   
3  gs://upload-raw-images/circleseafoods-camera-0...   
4  gs://upload-raw-images/circleseafoods-camera-0...   

                  Dataset ID            Dataset Name  \
0  clyxesxqf00se0776dq71ylh0  Circleseafoods-18-July   
1  clyxesxqf00se0776dq71ylh0  Circleseafoods-18-July   
2  clyxesxqf00se0776dq71ylh0  Circleseafoods-18-July   
3  clyxesxqf00se0776dq71ylh0  Circleseafoods-18-July   
4  clyxesxqf00se0776dq71ylh0  Circleseafoods-18-July   

                      Created At                     Updated At  \
0  2024-07-22T19:58:52.688+00:00  2024-07-22T19:58:59.539+00:00   
1  2024-07-22T19:58:52.688+00:00  2024-07-22T19:59:07.704+00:00   
2  2024-07-22T19:58:52.688+00:00  2024-07-22T19:59:08.212+00:00   
3  2024-07-22T19:58:52.688+00:00  2024-07-22T19:59:07.983+00:00   
4  2024-07-22T19:58:52.688+00:00  2024-07-22T19:59:07.228+00:00   

         Created By  Height  Width  ...   Annotation Kind Bounding Box Top  \
0  deepak@this.fish     720   1280  ...  ImageBoundingBox                0   
1  deepak@this.fish     720   1280  ...  ImageBoundingBox                0   
2  deepak@this.fish     720   1280  ...  ImageBoundingBox                0   
3  deepak@this.fish     720   1280  ...  ImageBoundingBox                0   
4  deepak@this.fish     720   1280  ...  ImageBoundingBox                0   

   Bounding Box Left Bounding Box Height Bounding Box Width species  gender  \
0                492                 699                193    chum  female   
1                449                 720                235    chum    male   
2                448                 720                235    chum    male   
3                449                 720                237    chum    male   
4                449                 720                238    chum    male   

    color weight                                           img_path  
0  bright   4.75  /opt/weight_dataset_v1/2024_07_18_17_36_30_792...  
1    dark  13.95  /opt/weight_dataset_v1/2024_07_18_17_37_23_113...  
2    dark  13.95  /opt/weight_dataset_v1/2024_07_18_17_37_23_113...  
3    dark  13.95  /opt/weight_dataset_v1/2024_07_18_17_37_23_113...  
4    dark  13.95  /opt/weight_dataset_v1/2024_07_18_17_37_25_262...  

[5 rows x 35 columns]

Missing values per column:
ID                       0
Global Key               0
Row Data                 0
Dataset ID               0
Dataset Name             0
Created At               0
Updated At               0
Created By               0
Height                   0
Width                    0
Asset Type               0
MIME Type                0
EXIF Rotation            0
Experiment ID            0
Experiment Name          0
Run Name                 0
Run Data Row ID          0
Split                    0
Label Kind               0
Version                  0
Label ID                 0
Feature ID               0
Feature Schema ID        0
Name                     0
Value                    0
Annotation Kind          0
Bounding Box Top         0
Bounding Box Left        0
Bounding Box Height      0
Bounding Box Width       0
species                  0
gender                 552
color                    0
weight                   0
img_path                 0
dtype: int64
No description has been provided for this image
Statistical Summary:
       Height   Width  EXIF Rotation  Bounding Box Top  Bounding Box Left  \
count   675.0   675.0          675.0        675.000000         675.000000   
mean    720.0  1280.0            1.0         52.625185         488.053333   
std       0.0     0.0            0.0         60.888040          30.595816   
min     720.0  1280.0            1.0          0.000000         375.000000   
25%     720.0  1280.0            1.0          0.000000         474.000000   
50%     720.0  1280.0            1.0         19.000000         491.000000   
75%     720.0  1280.0            1.0        111.000000         508.000000   
max     720.0  1280.0            1.0        189.000000         541.000000   

       Bounding Box Height  Bounding Box Width      weight  
count           675.000000          675.000000  675.000000  
mean            640.204444          178.694815    5.213407  
std              77.320269           30.051073    2.623659  
min             440.000000          119.000000    1.800000  
25%             569.000000          155.000000    3.250000  
50%             667.000000          179.000000    4.750000  
75%             706.000000          199.000000    6.350000  
max             720.000000          269.000000   14.500000  

Unique values per categorical column:
ID: 675 unique values
Global Key: 675 unique values
Row Data: 675 unique values
Dataset ID: 1 unique values
Dataset Name: 1 unique values
Created At: 9 unique values
Updated At: 657 unique values
Created By: 1 unique values
Asset Type: 1 unique values
MIME Type: 1 unique values
Experiment ID: 1 unique values
Experiment Name: 1 unique values
Run Name: 1 unique values
Run Data Row ID: 675 unique values
Split: 3 unique values
Label Kind: 1 unique values
Version: 1 unique values
Label ID: 675 unique values
Feature ID: 675 unique values
Feature Schema ID: 1 unique values
Name: 1 unique values
Value: 1 unique values
Annotation Kind: 1 unique values
species: 2 unique values
gender: 2 unique values
color: 3 unique values
img_path: 675 unique values
No description has been provided for this image
Rows with extremely high weight (>99th percentile 14.50):
Empty DataFrame
Columns: [ID, Global Key, Row Data, Dataset ID, Dataset Name, Created At, Updated At, Created By, Height, Width, Asset Type, MIME Type, EXIF Rotation, Experiment ID, Experiment Name, Run Name, Run Data Row ID, Split, Label Kind, Version, Label ID, Feature ID, Feature Schema ID, Name, Value, Annotation Kind, Bounding Box Top, Bounding Box Left, Bounding Box Height, Bounding Box Width, species, gender, color, weight, img_path]
Index: []

[0 rows x 35 columns]

Duplicate Rows: 0
No description has been provided for this image

📊 Exploratory Data Analysis (EDA) Report¶

🔍 Dataset Overview¶

  • Total Records: 675
  • Total Columns: 35
  • Missing Values:
    • gender: 552 missing values (major issue)
  • Duplicate Rows: 0

📈 Missing Values Analysis¶

A heatmap was generated to visualize missing data. The gender column has a significant number of missing values.

📊 Statistical Summary¶

Key Numerical Features¶

Feature Mean Std Dev Min 25% 50% 75% Max
Bounding Box Height 640.2 77.3 440 569 667 706 720
Bounding Box Width 178.7 30.1 119 155 179 199 269
Weight 5.21 2.62 1.8 3.25 4.75 6.35 14.5

🚀 Unique Values in Categorical Columns¶

  • species: 2 unique values
  • gender: 2 unique values (but heavily missing)
  • color: 3 unique values

🚨 Outlier Detection¶

  • Weight Distribution: No extreme outliers were found above the 99th percentile (14.5).
  • Boxplot generated to visualize weight distribution.

📊 Data Distribution¶

  • Weight Histogram: A KDE histogram was plotted to observe distribution.
  • Categorical Columns: Counts of unique values were recorded.

📌 Conclusions¶

  1. Missing Data Concern: gender column has a high number of missing values (552 out of 675).
  2. No Extreme Outliers: No weight values exceeded the 99th percentile threshold.
  3. Data Consistency: No duplicate rows found.
  4. Weight Distribution: Appears normal, mostly between 3.25 and 6.35.
  5. Categorical Insights: Limited unique values in key categorical variables.

🛠 Suggested Next Steps¶

  • Consider imputing or removing the gender column due to excessive missing data.
  • Further analysis on species and color for classification.
  • More in-depth correlation analysis between weight and bounding box dimensions.
In [17]:
import seaborn as sns
import matplotlib.pyplot as plt

# Examining the relationship between species and gender
sns.countplot(x="species", hue="gender", data=df)
plt.title("Species and Gender Distribution")
plt.show()

# Examining the relationship between color and gender
sns.countplot(x="color", hue="gender", data=df)
plt.title("Color and Gender Distribution")
plt.show()

# Species and gender distribution table
species_gender_counts = df.groupby(["species", "gender"]).size().unstack()
print("### Species vs Gender Distribution ###")
print(species_gender_counts)

# Color and gender distribution table
color_gender_counts = df.groupby(["color", "gender"]).size().unstack()
print("\n### Color vs Gender Distribution ###")
print(color_gender_counts)
No description has been provided for this image
No description has been provided for this image
### Species vs Gender Distribution ###
gender   female  male
species              
chum         99    24

### Color vs Gender Distribution ###
gender       female  male
color                    
bright          7.0   NaN
dark           63.0  24.0
semi_bright    29.0   NaN

📊 Exploratory Data Analysis (EDA) Results¶

1️⃣ Species vs. Gender Distribution¶

Raw Data:¶

Species Female Male
Chum 99 24

Analysis & Interpretation:¶

  • Chum species has a strong gender imbalance, with significantly more females (99) than males (24).
  • If gender is missing for a Chum individual, it is highly likely to be Female.
  • This imbalance suggests that species can be used as a predictive feature for missing gender values.

--

2️⃣ Color vs. Gender Distribution¶

Raw Data:¶

Color Female Male
Bright 7 0 (NaN)
Dark 63 24
Semi-Bright 29 0 (NaN)

Analysis & Interpretation:¶

  • Bright and Semi-Bright colors only have female individuals, meaning if an individual has these colors, it is highly likely to be Female.
  • Dark color has both male (24) and female (63) individuals, but females dominate.
  • Color can be a useful predictor for gender, especially for Bright and Semi-Bright individuals.

In [18]:
# Calculate the average weight by gender
gender_weight_mean = df.groupby("gender")["weight"].mean()
print("### Gender vs Weight ###")
print(gender_weight_mean)
### Gender vs Weight ###
gender
female     7.048485
male      14.225000
Name: weight, dtype: float64

3️⃣ Weight vs. Gender Distribution¶

Raw Data:¶

Gender Average Weight (kg)
Female 7.05 kg
Male 14.23 kg

Analysis & Interpretation:¶

  • Males are almost twice as heavy as females (14.23 kg vs. 7.05 kg).
  • This strong difference means weight can be used to predict missing gender values:
    • If weight > 10 kg → Highly likely to be Male
    • If weight ≤ 10 kg → Highly likely to be Female
In [19]:
import seaborn as sns

# Compute the correlation matrix
correlation_matrix = df[['weight', 'Bounding Box Height', 'Bounding Box Width']].corr()

# Plot the correlation heatmap
plt.figure(figsize=(8,6))
sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm", fmt=".2f")
plt.title("Correlation Between Weight and Bounding Box Dimensions")
plt.show()

# Print correlation values
print("### Correlation Matrix ###")
print(correlation_matrix)
No description has been provided for this image
### Correlation Matrix ###
                       weight  Bounding Box Height  Bounding Box Width
weight               1.000000             0.756049            0.783899
Bounding Box Height  0.756049             1.000000            0.717996
Bounding Box Width   0.783899             0.717996            1.000000

4️⃣ Correlation Analysis (Weight & Bounding Box Dimensions)¶

Raw Data (Correlation Matrix):¶

Feature Weight Bounding Box Height Bounding Box Width
Weight 1.00 0.76 0.78
Bounding Box Height 0.76 1.00 0.72
Bounding Box Width 0.78 0.72 1.00

Analysis & Interpretation:¶

  • Weight has a strong correlation with both Bounding Box Height (0.76) and Bounding Box Width (0.78).
  • This means Bounding Box dimensions can be used to estimate missing weight values.
  • Bounding Box Width and Height are also correlated (0.72), meaning they are somewhat redundant.
  • If an individual’s weight is missing, it can be estimated using its bounding box measurements.

📌 General Conclusions¶

✅ Species (Chum) is strongly biased towards females, meaning missing gender values for this species are most likely Female.
✅ Bright and Semi-Bright colors are exclusively Female, making color a strong predictor for gender.
✅ Weight is a highly effective predictor of gender, as males are significantly heavier than females.
✅ Bounding Box Height & Width are strongly correlated with weight and can be used for missing weight estimation.
✅ For missing gender values, a rule-based or machine learning model can be built using species, color, and weight.


🛠 Recommended Next Steps¶

1️⃣ Predict missing gender values using rules:

  • If color is Bright or Semi-Bright → Assign "Female".
  • If species is Chum → Assign "Female". 2️⃣ Use a regression model to estimate missing weight based on bounding box dimensions.
    3️⃣ Train a machine learning model (Random Forest or Logistic Regression) to predict missing gender values.

2. Data Preprocessing¶

📌 Updated Strategy for Filling Missing Gender Values¶

Instead of using species, we will rely only on weight and color, as they have a more direct relationship with gender.

📊 Rule-Based Gender Imputation¶

1️⃣ If color is "bright" or "semi_bright" → Assign "Female"

  • (Since these colors had only female individuals)

2️⃣ If weight > 10 kg → Assign "Male"

  • (Since males are significantly heavier on average)

3️⃣ Otherwise → Assign "Female"

  • (Because females dominate in the dataset)
In [20]:
import pandas as pd
 
# Count missing values before update
missing_before = df["gender"].isnull().sum()
 
 
# Function to predict missing gender
def predict_gender(row):
    if pd.isnull(row["gender"]):  # Only apply if gender is missing
        # Rule 1: If color is Bright or Semi-Bright, it's definitely Female
        if row["color"] in ["bright", "semi_bright"]:
            return "female"
        # Rule 2: If weight is greater than 10 kg, it's most likely Male
        elif row["weight"] > 10:
            return "male"
        # Default Rule: Assign Female (since females are the majority)
        else:
            return "female"
    else:
        return row["gender"]

# Apply the function to fill missing gender values
df["gender"] = df.apply(predict_gender, axis=1)

# Display how many missing values remain
print("Missing values after filling:", df["gender"].isnull().sum())

# Count missing values after update
missing_after = df["gender"].isnull().sum()

# Calculate how many missing values were filled
updated_count = missing_before - missing_after
df_original = df
# Print the summary
print(f"🚀 Gender Update Summary 🚀")
print(f"🔹 Missing values before update: {missing_before}")
print(f"🔹 Missing values after update: {missing_after}")
print(f"✅ Total updated gender values: {updated_count}")
Missing values after filling: 0
🚀 Gender Update Summary 🚀
🔹 Missing values before update: 552
🔹 Missing values after update: 0
✅ Total updated gender values: 552

🚀 Gender Data Update Report¶

🔍 Summary of Changes¶

Metric Value
Missing gender values before update 552
Missing gender values after update 0
Total updated gender values 552

✅ All 552 missing gender values were successfully filled.
✅ The dataset now has no missing gender values, ensuring consistency for further analysis.


📌 Methodology Used for Gender Imputation¶

To ensure accuracy and maintain data integrity, a rule-based approach was applied:

1️⃣ If color was "bright" or "semi_bright" → Assigned "Female"

  • These colors had only female individuals in the dataset, making this rule 100% accurate.

2️⃣ If weight > 10 kg → Assigned "Male"

  • Male individuals were significantly heavier than females, making weight a strong predictor.

3️⃣ Otherwise → Assigned "Female"

  • Females were the dominant gender in the dataset.

⚠️ Species data was intentionally excluded to avoid introducing bias, as gender data was heavily missing in some species categories.


📊 Impact on Data Quality¶

  • The dataset is now fully structured, allowing for more reliable insights and predictions.
  • Data consistency and completeness have significantly improved.
  • Potential risk: Some borderline cases (e.g., individuals with weight around 10 kg) may require manual verification.

📌 Recommendation: A random sample verification step can further confirm accuracy.


📌 Next Steps & Recommendations¶

🔹 Perform a quality check on a small subset of the updated values to confirm accuracy.
🔹 Validate gender distribution trends to ensure logical consistency.
🔹 If required, develop a machine learning model to refine gender predictions for future data.


✅ Final Verdict¶

✔ The dataset is now complete, with no missing gender values.
✔ The applied method follows logical rules based on observed trends.
✔ Further validation is recommended for ensuring long-term data quality.

📌 If further refinements or additional validation steps are needed, we are happy to assist! 🚀

3. Correlation Matrix¶

In [21]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder

print("Dataset Overview:")
print(df.info())

print("\nFirst 5 Rows:")
print(df.head())

# Print original column names
print("Original Columns:", df.columns.tolist())

# Clean column names (remove spaces and convert to lowercase)
df.rename(columns=lambda x: x.strip().lower().replace(" ", "_"), inplace=True)
print('height -->')
print(df.dtypes)

# Print updated column names
print("Updated Columns:", df.columns.tolist())

# Identify constant columns (having only one unique value) and drop them
constant_cols = [col for col in df.columns if df[col].nunique() == 1]
print("Constant Columns:", constant_cols)
df = df.drop(columns=constant_cols)

# Convert 'weight' column to float if necessary
df['weight'] = df['weight'].astype(float)

# Fill NaN values in 'weight' column with the median
df['weight'] = df['weight'].fillna(df['weight'].median())

# **DETECT AND ENCODE CATEGORICAL COLUMNS**
category_col = next((col for col in df.columns if 'category' in col.lower()), None)

if category_col and df[category_col].nunique() > 1:
    df['category_encoded'] = LabelEncoder().fit_transform(df[category_col])
    print(f"'{category_col}' column found and encoded.")
else:
    print("Warning: No suitable 'category' column found or column has only one unique value.")

# Identify and encode general categorical columns
categorical_cols = df.select_dtypes(include=['object']).columns.tolist()
if categorical_cols:
    print("Categorical Columns:", categorical_cols)
    for col in categorical_cols:
        df[col] = LabelEncoder().fit_transform(df[col].astype(str))
else:
    print("No categorical columns found in the dataset.")

# **IDENTIFY NUMERIC COLUMNS AND FILL MISSING VALUES**
numeric_cols = df.select_dtypes(include=['number']).columns.tolist()
print("Numeric Columns:", numeric_cols)

if numeric_cols:
    df[numeric_cols] = df[numeric_cols].fillna(df[numeric_cols].median())
else:
    print("Warning: No numeric columns found!")
Dataset Overview:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 675 entries, 0 to 674
Data columns (total 35 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   ID                   675 non-null    object 
 1   Global Key           675 non-null    object 
 2   Row Data             675 non-null    object 
 3   Dataset ID           675 non-null    object 
 4   Dataset Name         675 non-null    object 
 5   Created At           675 non-null    object 
 6   Updated At           675 non-null    object 
 7   Created By           675 non-null    object 
 8   Height               675 non-null    int64  
 9   Width                675 non-null    int64  
 10  Asset Type           675 non-null    object 
 11  MIME Type            675 non-null    object 
 12  EXIF Rotation        675 non-null    int64  
 13  Experiment ID        675 non-null    object 
 14  Experiment Name      675 non-null    object 
 15  Run Name             675 non-null    object 
 16  Run Data Row ID      675 non-null    object 
 17  Split                675 non-null    object 
 18  Label Kind           675 non-null    object 
 19  Version              675 non-null    object 
 20  Label ID             675 non-null    object 
 21  Feature ID           675 non-null    object 
 22  Feature Schema ID    675 non-null    object 
 23  Name                 675 non-null    object 
 24  Value                675 non-null    object 
 25  Annotation Kind      675 non-null    object 
 26  Bounding Box Top     675 non-null    int64  
 27  Bounding Box Left    675 non-null    int64  
 28  Bounding Box Height  675 non-null    int64  
 29  Bounding Box Width   675 non-null    int64  
 30  species              675 non-null    object 
 31  gender               675 non-null    object 
 32  color                675 non-null    object 
 33  weight               675 non-null    float64
 34  img_path             675 non-null    object 
dtypes: float64(1), int64(7), object(27)
memory usage: 184.7+ KB
None

First 5 Rows:
                          ID  \
0  clyxetrm60mcb0796rsdu4ob9   
1  clyxetrm60mcc0796uilaudlq   
2  clyxetrm60mcd0796albl43as   
3  clyxetrm60mce0796r9gf6geg   
4  clyxetrm60mcf0796zbzj1nb6   

                                          Global Key  \
0  upload-raw-images/circleseafoods-camera-03/202...   
1  upload-raw-images/circleseafoods-camera-03/202...   
2  upload-raw-images/circleseafoods-camera-03/202...   
3  upload-raw-images/circleseafoods-camera-03/202...   
4  upload-raw-images/circleseafoods-camera-03/202...   

                                            Row Data  \
0  gs://upload-raw-images/circleseafoods-camera-0...   
1  gs://upload-raw-images/circleseafoods-camera-0...   
2  gs://upload-raw-images/circleseafoods-camera-0...   
3  gs://upload-raw-images/circleseafoods-camera-0...   
4  gs://upload-raw-images/circleseafoods-camera-0...   

                  Dataset ID            Dataset Name  \
0  clyxesxqf00se0776dq71ylh0  Circleseafoods-18-July   
1  clyxesxqf00se0776dq71ylh0  Circleseafoods-18-July   
2  clyxesxqf00se0776dq71ylh0  Circleseafoods-18-July   
3  clyxesxqf00se0776dq71ylh0  Circleseafoods-18-July   
4  clyxesxqf00se0776dq71ylh0  Circleseafoods-18-July   

                      Created At                     Updated At  \
0  2024-07-22T19:58:52.688+00:00  2024-07-22T19:58:59.539+00:00   
1  2024-07-22T19:58:52.688+00:00  2024-07-22T19:59:07.704+00:00   
2  2024-07-22T19:58:52.688+00:00  2024-07-22T19:59:08.212+00:00   
3  2024-07-22T19:58:52.688+00:00  2024-07-22T19:59:07.983+00:00   
4  2024-07-22T19:58:52.688+00:00  2024-07-22T19:59:07.228+00:00   

         Created By  Height  Width  ...   Annotation Kind Bounding Box Top  \
0  deepak@this.fish     720   1280  ...  ImageBoundingBox                0   
1  deepak@this.fish     720   1280  ...  ImageBoundingBox                0   
2  deepak@this.fish     720   1280  ...  ImageBoundingBox                0   
3  deepak@this.fish     720   1280  ...  ImageBoundingBox                0   
4  deepak@this.fish     720   1280  ...  ImageBoundingBox                0   

   Bounding Box Left Bounding Box Height Bounding Box Width species  gender  \
0                492                 699                193    chum  female   
1                449                 720                235    chum    male   
2                448                 720                235    chum    male   
3                449                 720                237    chum    male   
4                449                 720                238    chum    male   

    color weight                                           img_path  
0  bright   4.75  /opt/weight_dataset_v1/2024_07_18_17_36_30_792...  
1    dark  13.95  /opt/weight_dataset_v1/2024_07_18_17_37_23_113...  
2    dark  13.95  /opt/weight_dataset_v1/2024_07_18_17_37_23_113...  
3    dark  13.95  /opt/weight_dataset_v1/2024_07_18_17_37_23_113...  
4    dark  13.95  /opt/weight_dataset_v1/2024_07_18_17_37_25_262...  

[5 rows x 35 columns]
Original Columns: ['ID', 'Global Key', 'Row Data', 'Dataset ID', 'Dataset Name', 'Created At', 'Updated At', 'Created By', 'Height', 'Width', 'Asset Type', 'MIME Type', 'EXIF Rotation', 'Experiment ID', 'Experiment Name', 'Run Name', 'Run Data Row ID', 'Split', 'Label Kind', 'Version', 'Label ID', 'Feature ID', 'Feature Schema ID', 'Name', 'Value', 'Annotation Kind', 'Bounding Box Top', 'Bounding Box Left', 'Bounding Box Height', 'Bounding Box Width', 'species', 'gender', 'color', 'weight', 'img_path']
height -->
id                      object
global_key              object
row_data                object
dataset_id              object
dataset_name            object
created_at              object
updated_at              object
created_by              object
height                   int64
width                    int64
asset_type              object
mime_type               object
exif_rotation            int64
experiment_id           object
experiment_name         object
run_name                object
run_data_row_id         object
split                   object
label_kind              object
version                 object
label_id                object
feature_id              object
feature_schema_id       object
name                    object
value                   object
annotation_kind         object
bounding_box_top         int64
bounding_box_left        int64
bounding_box_height      int64
bounding_box_width       int64
species                 object
gender                  object
color                   object
weight                 float64
img_path                object
dtype: object
Updated Columns: ['id', 'global_key', 'row_data', 'dataset_id', 'dataset_name', 'created_at', 'updated_at', 'created_by', 'height', 'width', 'asset_type', 'mime_type', 'exif_rotation', 'experiment_id', 'experiment_name', 'run_name', 'run_data_row_id', 'split', 'label_kind', 'version', 'label_id', 'feature_id', 'feature_schema_id', 'name', 'value', 'annotation_kind', 'bounding_box_top', 'bounding_box_left', 'bounding_box_height', 'bounding_box_width', 'species', 'gender', 'color', 'weight', 'img_path']
Constant Columns: ['dataset_id', 'dataset_name', 'created_by', 'height', 'width', 'asset_type', 'mime_type', 'exif_rotation', 'experiment_id', 'experiment_name', 'run_name', 'label_kind', 'version', 'feature_schema_id', 'name', 'value', 'annotation_kind']
Warning: No suitable 'category' column found or column has only one unique value.
Categorical Columns: ['id', 'global_key', 'row_data', 'created_at', 'updated_at', 'run_data_row_id', 'split', 'label_id', 'feature_id', 'species', 'gender', 'color', 'img_path']
Numeric Columns: ['id', 'global_key', 'row_data', 'created_at', 'updated_at', 'run_data_row_id', 'split', 'label_id', 'feature_id', 'bounding_box_top', 'bounding_box_left', 'bounding_box_height', 'bounding_box_width', 'species', 'gender', 'color', 'weight', 'img_path']
In [22]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split

# Load dataset

print("Key Features Preview:\n", df[['species', 'gender', 'color', 'weight', 
                                   'bounding_box_height', 'bounding_box_width']].head())

# 1. FEATURE SELECTION & CLEANING
# Keep only biological/visual features
relevant_features = [
    'species',          # Fish species
    'gender',           # Biological gender
    'color',            # Color pattern
    'bounding_box_height',  # Pixel height from image analysis
    'bounding_box_width',   # Pixel width from image analysis
    'weight'            # Target variable
]

df = df[relevant_features].copy()

# 2. CATEGORICAL FEATURE PROCESSING
# One-Hot Encoding for categorical variables
categorical_cols = ['species', 'gender', 'color']
df = pd.get_dummies(df, columns=categorical_cols, drop_first=True)

# 3. FEATURE ENGINEERING
# Create meaningful derived features
df['fish_area'] = df['bounding_box_height'] * df['bounding_box_width']  # Area approximation
df['aspect_ratio'] = df['bounding_box_width'] / df['bounding_box_height']  # Shape characteristic

# 4. CORRELATION ANALYSIS (Enhanced)
plt.figure(figsize=(12,8))
corr_matrix = df.corr()
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))  # Upper triangle mask

sns.heatmap(corr_matrix.where(mask), 
            annot=True, 
            cmap="coolwarm", 
            fmt=".2f", 
            linewidths=0.5,
            vmin=-1, 
            vmax=1,
            cbar_kws={"shrink": 0.8})
plt.title("Feature Correlation Matrix", fontsize=14)
plt.xticks(rotation=45)
plt.yticks(rotation=0)
plt.tight_layout()
plt.show()

# 5. DATA PREPARATION
# Handle missing values
print("\nMissing Values Check:")
print(df.isnull().sum())

# Final cleaning
df = df.dropna().reset_index(drop=True)

# Split features and target
X = df.drop('weight', axis=1)
y = df['weight']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2, 
    random_state=42
)

print("\nFinal Dataset Shapes:")
print(f"Training set: {X_train.shape}, Test set: {X_test.shape}")

# 6. VISUAL ANALYSIS
plt.figure(figsize=(10,6))
sns.scatterplot(
    x='fish_area', 
    y='weight',
    hue='species_chum' if 'species_chum' in df.columns else None,
    data=df,
    palette='viridis',
    alpha=0.7
)
plt.title("Fish Area vs Weight Relationship", fontsize=14)
plt.xlabel("Fish Area (pixels²)")
plt.ylabel("Weight (kg)")
plt.grid(True, alpha=0.3)
plt.show()

# Print first 10 data points used in the scatter plot
print("\nFirst 10 data points for Fish Area vs Weight:")
print(df[['fish_area', 'weight']].head(10))
Key Features Preview:
    species  gender  color  weight  bounding_box_height  bounding_box_width
0        0       0      0    4.75                  699                 193
1        0       1      1   13.95                  720                 235
2        0       1      1   13.95                  720                 235
3        0       1      1   13.95                  720                 237
4        0       1      1   13.95                  720                 238
No description has been provided for this image
Missing Values Check:
bounding_box_height    0
bounding_box_width     0
weight                 0
species_1              0
gender_1               0
color_1                0
color_2                0
fish_area              0
aspect_ratio           0
dtype: int64

Final Dataset Shapes:
Training set: (540, 8), Test set: (135, 8)
/tmp/ipykernel_155549/425341773.py:78: UserWarning: Ignoring `palette` because no `hue` variable has been assigned.
  sns.scatterplot(
No description has been provided for this image
First 10 data points for Fish Area vs Weight:
   fish_area  weight
0     134907    4.75
1     169200   13.95
2     169200   13.95
3     170640   13.95
4     171360   13.95
5     170640   13.95
6     171360   13.95
7     149175   13.95
8     143808   13.95
9     147150   13.95

📊 Insights & Evaluation¶

🔹 Feature Selection & Cleaning¶

  • The dataset retains key biological and visual features relevant to fish weight prediction.
  • One-hot encoding was applied to categorical variables (species, gender, color), allowing them to be used in machine learning models.

🔹 Correlation Analysis¶

  • Fish Area (bounding_box_height × bounding_box_width) shows a strong positive correlation with weight, suggesting it is a key predictor.
  • Aspect Ratio (bounding_box_width / bounding_box_height) has a weaker correlation, indicating its limited direct impact on weight prediction.

🔹 Data Preparation¶

  • Missing values were successfully handled, reducing data inconsistencies.
  • The dataset was split into 540 training samples and 135 test samples, ensuring a reasonable train-test ratio (80-20 split).

🔹 Visual Analysis: Fish Area vs. Weight¶

  • A positive relationship is observed between fish area and weight, confirming that larger fish generally weigh more.
  • The data shows some outliers, which might need further investigation.
  • The scatter plot could benefit from color differentiation by species to analyze species-specific weight variations.

🚀 Next Steps¶

  1. Feature Scaling: Standardize numerical features to improve model performance.
  2. Outlier Detection: Examine extreme values in the dataset for potential errors or irregularities.
  3. Model Selection: Compare regression models (Linear Regression, Decision Trees, Neural Networks) to identify the best predictive approach.

📈 Key Findings¶

  • Bounding Box Dimensions:

    • bounding_box_height and bounding_box_width are highly correlated (0.72), as expected.
    • Both features strongly correlate with fish_area (0.88 and 0.96), confirming that area is directly dependent on these values.
  • Weight Relationships:

    • weight has a strong correlation with fish_area (0.85), suggesting that larger fish tend to be heavier.
    • However, weight has a negative correlation with species_1 (-0.63), which implies that different species may have significantly different weight distributions.
  • Species & Color Influence:

    • species_1 has a weak negative correlation with color_1 (-0.36) and color_2 (-0.46), meaning species classification may not be strongly influenced by color.
    • gender_1 has a mild correlation with color_1 (0.36), indicating possible gender-based color differences.

📌 Conclusion¶

  • The high correlation between bounding_box_width, bounding_box_height, and fish_area suggests redundancy; one of these features may be removed or transformed.
  • species_1 and weight show a meaningful inverse relationship, which can be explored further for classification tasks.
  • The weak correlations between species and colors suggest that color alone is not a strong distinguishing factor between species.

4. Feature Selection¶

In [ ]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split

# Drop missing values
df.dropna(inplace=True)

# Encode categorical variables
label_encoders = {}
for col in df.select_dtypes(include=["object"]).columns:
    le = LabelEncoder()
    df[col] = le.fit_transform(df[col])
    label_encoders[col] = le

# Define features and target variable
X = df.drop(columns=["weight"])  # Specify your target column
y = df["weight"]

# Scale features
scaler = StandardScaler()  #-- Tabular Data Normalization
X_scaled = scaler.fit_transform(X)

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Train Random Forest for feature importance
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Compute feature importance
feature_importances = pd.DataFrame({
    "Feature": X.columns,
    "Importance": model.feature_importances_
})

# Include `fish_area` and `aspect_ratio` in ranking
feature_importances = feature_importances[feature_importances["Feature"].isin(["fish_area", "aspect_ratio"]) | True]
feature_importances = feature_importances.sort_values(by="Importance", ascending=False)

#Print 
print("Feature Importances:\n", feature_importances)
# Plot feature importance
plt.figure(figsize=(12, 6))
sns.barplot(x="Importance", y="Feature", data=feature_importances, palette="coolwarm")
plt.title("Feature Importance (Random Forest)")
plt.show()
Feature Importances:
                Feature  Importance
6            fish_area    0.506355
0  bounding_box_height    0.311614
3             gender_1    0.157656
7         aspect_ratio    0.013495
1   bounding_box_width    0.006237
5              color_2    0.002517
4              color_1    0.002121
2            species_1    0.000005
/tmp/ipykernel_155549/3516137029.py:48: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.

  sns.barplot(x="Importance", y="Feature", data=feature_importances, palette="coolwarm")
No description has been provided for this image

📊 Feature Importance Analysis (Random Forest)¶

🔹 Feature Ranking & Interpretation¶

Rank Feature Importance Interpretation
1️⃣ fish_area 0.5064 The most important feature (~50.6%). Indicates that the total area occupied by the fish is highly correlated with weight.
2️⃣ bounding_box_height 0.3116 Highly important (~31.2%). Suggests that fish height is a strong predictor of weight.
3️⃣ gender_1 0.1577 Moderately important (~15.8%). Gender differences may impact weight distribution.
4️⃣ aspect_ratio 0.0135 Low importance (~1.3%). Shape alone is not a strong determinant of weight.
5️⃣ bounding_box_width 0.0062 Very low importance (~0.6%). Width is much less relevant than height.
6️⃣ color_2 0.0025 Negligible importance (~0.25%). Fish color does not significantly impact weight prediction.
7️⃣ color_1 0.0021 Almost irrelevant (~0.21%). Similar to color_2, color is not a key weight determinant.
8️⃣ species_1 0.000005 Insignificant. Species type does not provide useful information for weight prediction.

Exploratory Data Analysis and Image Preprocessing¶

In [24]:
# import cv2
# import numpy as np
# import seaborn as sns
# import matplotlib.pyplot as plt
# from sklearn.model_selection import train_test_split

# # Data Visualization
# sns.histplot(df['weight'], bins=30, kde=True)
# plt.title("Weight Distribution")
# plt.show()

# print(df.columns)

# # Image Preprocessing Function
# def load_and_preprocess_image(img_path, img_size=(224, 224)):
#     img = cv2.imread(img_path)
#     if img is None:
#         print(f"Warning: Image at {img_path} not found.")
#         return np.zeros((*img_size, 3))  # Return a blank image if missing
#     img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
#     img = cv2.resize(img, img_size) / 255.0
#     return img

# # Restore 'img_path'
# df['img_path'] = df_original['img_path']

# # Process images efficiently
# image_paths = df['img_path'].values
# images = np.array([load_and_preprocess_image(img) for img in image_paths])

# # Splitting Data
# X = images  # Already in NumPy array format
# y = df['weight'].values
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Comparing XGBoost, Random Forest, and DenseNet for Regression¶

In [25]:
# import numpy as np
# import tensorflow as tf
# import matplotlib.pyplot as plt
# import seaborn as sns
# from tensorflow.keras.applications import DenseNet121
# from tensorflow.keras.models import Sequential
# from tensorflow.keras.layers import Dense, Dropout, GlobalAveragePooling2D
# from tensorflow.keras.preprocessing.image import ImageDataGenerator
# from sklearn.ensemble import RandomForestRegressor
# from xgboost import XGBRegressor
# from sklearn.model_selection import train_test_split

# # Reshape Data for XGBoost & Random Forest
# X_train_flat = X_train.reshape(X_train.shape[0], -1)  # Flatten images for XGBoost & RF
# X_test_flat = X_test.reshape(X_test.shape[0], -1)

# # ---------------- XGBoost Model ----------------
# xgb_model = XGBRegressor()
# xgb_model.fit(X_train_flat, y_train)

# # ---------------- Random Forest Model ----------------
# rf_model = RandomForestRegressor()
# rf_model.fit(X_train_flat, y_train)

# # ---------------- Deep Learning Model ----------------
# def create_dense_net():
#     base_model = DenseNet121(weights='imagenet', include_top=False, input_shape=(224, 224, 3))
#     base_model.trainable = False  # Freeze pre-trained weights
#     model = Sequential([
#         base_model,
#         GlobalAveragePooling2D(),
#         Dense(128, activation='relu'),
#         Dropout(0.3),
#         Dense(1, activation='linear')
#     ])
#     model.compile(optimizer='adam', loss='mse', metrics=['mae'])
#     return model

# # Data Augmentation (Only for CNN)
# datagen = ImageDataGenerator(
#     rotation_range=20,
#     width_shift_range=0.2,
#     height_shift_range=0.2,
#     horizontal_flip=True
# )

# # Train CNN Model
# nn_model = create_dense_net()
# nn_model.fit(datagen.flow(X_train, y_train, batch_size=32), 
#              epochs=10, 
#              validation_data=(X_test, y_test))

# # ---------------- Model Evaluation ----------------
# xgb_pred = xgb_model.predict(X_test_flat)
# rf_pred = rf_model.predict(X_test_flat)
# nn_pred = nn_model.predict(X_test)

# # ---------------- Plot Predictions ----------------
# plt.figure(figsize=(10, 5))
# sns.scatterplot(x=y_test, y=xgb_pred, label='XGBoost', alpha=0.6)
# sns.scatterplot(x=y_test, y=rf_pred, label='Random Forest', alpha=0.6)
# sns.scatterplot(x=y_test, y=nn_pred[:, 0], label='DenseNet', alpha=0.6)

# plt.xlabel("Actual Weight")
# plt.ylabel("Predicted Weight")
# plt.title("Model Predictions vs. Actual Weight")
# plt.legend()
# plt.show()
In [27]:
# from sklearn.metrics import mean_squared_error, mean_absolute_error

# # Compute error metrics
# xgb_mse = mean_squared_error(y_test, xgb_pred)
# xgb_rmse = np.sqrt(xgb_mse)
# xgb_mae = mean_absolute_error(y_test, xgb_pred)

# rf_mse = mean_squared_error(y_test, rf_pred)
# rf_rmse = np.sqrt(rf_mse)
# rf_mae = mean_absolute_error(y_test, rf_pred)

# nn_mse = mean_squared_error(y_test, nn_pred)
# nn_rmse = np.sqrt(nn_mse)
# nn_mae = mean_absolute_error(y_test, nn_pred)

# # Print results
# print(f"XGBoost - MSE: {xgb_mse:.4f}, RMSE: {xgb_rmse:.4f}, MAE: {xgb_mae:.4f}")
# print(f"Random Forest - MSE: {rf_mse:.4f}, RMSE: {rf_rmse:.4f}, MAE: {rf_mae:.4f}")
# print(f"DenseNet - MSE: {nn_mse:.4f}, RMSE: {nn_rmse:.4f}, MAE: {nn_mae:.4f}")

# # Determine the best model based on RMSE
# models = {"XGBoost": xgb_rmse, "Random Forest": rf_rmse, "DenseNet": nn_rmse}
# best_model = min(models, key=models.get)
# print(f"\nThe best performing model is: {best_model}")

Exploratory Data Analysis and Image Preprocessing¶

In [ ]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import cv2
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.applications import ResNet50, EfficientNetB0, DenseNet121
from tensorflow.keras.layers import Dense, Dropout, Flatten, GlobalAveragePooling2D
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.optimizers import Adam
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from xgboost import XGBRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR

# Data Visualization
sns.histplot(df['weight'], bins=30, kde=True)
plt.title("Weight Distribution")
plt.show()

print(df.columns)
# Function to Read and Preprocess Images --Image Data Normalization
def load_and_preprocess_image(img_path, img_size=(224, 224)):
    img = cv2.imread(img_path)
    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
    img = cv2.resize(img, img_size) / 255.0  # Normalize to [0,1]
    return img

# Restore 'img_path'
df['img_path'] = df_original['img_path']

# Process images efficiently
image_paths = df['img_path'].values
images = np.array([load_and_preprocess_image(img) for img in image_paths])

# Splitting Data
X = images  # Already in NumPy array format
y = df['weight'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
No description has been provided for this image
Index(['bounding_box_height', 'bounding_box_width', 'weight', 'species_1',
       'gender_1', 'color_1', 'color_2', 'fish_area', 'aspect_ratio'],
      dtype='object')

Comparing XGBoost, Random Forest, and DenseNet for Regression¶

In [30]:
# Reshape for ML models (Flatten Image Features)
X_train_flat = X_train.reshape(X_train.shape[0], -1)
X_test_flat = X_test.reshape(X_test.shape[0], -1)

# XGBoost Model 
xgb_model = XGBRegressor()
xgb_model.fit(X_train_flat, y_train)

# Random Forest Model
rf_model = RandomForestRegressor()
rf_model.fit(X_train_flat, y_train)

# SVR Model
svr_model = SVR()
svr_model.fit(X_train_flat, y_train)
Out[30]:
SVR()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
SVR()
In [31]:
def create_cnn_model(base_model):
    base_model.trainable = False
    model = Sequential([
        base_model,
        GlobalAveragePooling2D(),
        Dense(128, activation='relu'),
        Dropout(0.3),
        Dense(1, activation='linear')
    ])
    model.compile(optimizer=Adam(learning_rate=0.001), loss='mse', metrics=['mae'])
    return model

# DenseNet-Based Model
cnn_model = create_cnn_model(DenseNet121(weights='imagenet', include_top=False, input_shape=(224, 224, 3)))

# Data Augmentation
datagen = ImageDataGenerator(rotation_range=20, width_shift_range=0.2, height_shift_range=0.2, horizontal_flip=True)

# Training the Model
cnn_model.fit(datagen.flow(X_train, y_train, batch_size=32), epochs=10, validation_data=(X_test, y_test))
I0000 00:00:1741191843.510804  155549 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 499 MB memory:  -> device: 0, name: Tesla T4, pci bus id: 0000:00:04.0, compute capability: 7.5
/home/tessaayv/.local/lib/python3.10/site-packages/keras/src/trainers/data_adapters/py_dataset_adapter.py:121: UserWarning: Your `PyDataset` class should call `super().__init__(**kwargs)` in its constructor. `**kwargs` can include `workers`, `use_multiprocessing`, `max_queue_size`. Do not pass these arguments to `fit()`, as they will be ignored.
  self._warn_if_super_not_called()
Epoch 1/10
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1741191860.590934  160213 service.cc:148] XLA service 0x7c64a8003240 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1741191860.591934  160213 service.cc:156]   StreamExecutor device (0): Tesla T4, Compute Capability 7.5
2025-03-05 16:24:21.080408: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:268] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
I0000 00:00:1741191863.701297  160213 cuda_dnn.cc:529] Loaded cuDNN version 90300
2025-03-05 16:24:25.102520: W external/local_xla/xla/tsl/framework/bfc_allocator.cc:306] Allocator (GPU_0_bfc) ran out of memory trying to allocate 241.09MiB with freed_by_count=0. The caller indicates that this is not a failure, but this may mean that there could be performance gains if more memory were available.
2025-03-05 16:24:25.604711: W external/local_xla/xla/tsl/framework/bfc_allocator.cc:306] Allocator (GPU_0_bfc) ran out of memory trying to allocate 280.14MiB with freed_by_count=0. The caller indicates that this is not a failure, but this may mean that there could be performance gains if more memory were available.
2025-03-05 16:24:25.648408: W external/local_xla/xla/tsl/framework/bfc_allocator.cc:306] Allocator (GPU_0_bfc) ran out of memory trying to allocate 457.00MiB with freed_by_count=0. The caller indicates that this is not a failure, but this may mean that there could be performance gains if more memory were available.
2025-03-05 16:24:25.749729: W external/local_xla/xla/tsl/framework/bfc_allocator.cc:306] Allocator (GPU_0_bfc) ran out of memory trying to allocate 511.05MiB with freed_by_count=0. The caller indicates that this is not a failure, but this may mean that there could be performance gains if more memory were available.
2025-03-05 16:24:25.846455: W external/local_xla/xla/tsl/framework/bfc_allocator.cc:306] Allocator (GPU_0_bfc) ran out of memory trying to allocate 676.06MiB with freed_by_count=0. The caller indicates that this is not a failure, but this may mean that there could be performance gains if more memory were available.
2025-03-05 16:24:25.958033: W external/local_xla/xla/tsl/framework/bfc_allocator.cc:306] Allocator (GPU_0_bfc) ran out of memory trying to allocate 841.08MiB with freed_by_count=0. The caller indicates that this is not a failure, but this may mean that there could be performance gains if more memory were available.
2025-03-05 16:24:26.084448: W external/local_xla/xla/tsl/framework/bfc_allocator.cc:306] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1006.09MiB with freed_by_count=0. The caller indicates that this is not a failure, but this may mean that there could be performance gains if more memory were available.
2025-03-05 16:24:26.205593: W external/local_xla/xla/tsl/framework/bfc_allocator.cc:306] Allocator (GPU_0_bfc) ran out of memory trying to allocate 183.34MiB with freed_by_count=0. The caller indicates that this is not a failure, but this may mean that there could be performance gains if more memory were available.
2025-03-05 16:24:26.205649: W external/local_xla/xla/tsl/framework/bfc_allocator.cc:306] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.14GiB with freed_by_count=0. The caller indicates that this is not a failure, but this may mean that there could be performance gains if more memory were available.
2025-03-05 16:24:26.343300: W external/local_xla/xla/tsl/framework/bfc_allocator.cc:306] Allocator (GPU_0_bfc) ran out of memory trying to allocate 198.75MiB with freed_by_count=0. The caller indicates that this is not a failure, but this may mean that there could be performance gains if more memory were available.
 1/17 ━━━━━━━━━━━━━━━━━━━━ 7:01 26s/step - loss: 17.0839 - mae: 3.3259
I0000 00:00:1741191875.227670  160213 device_compiler.h:188] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.
17/17 ━━━━━━━━━━━━━━━━━━━━ 0s 1s/step - loss: 10.7192 - mae: 2.4925
2025-03-05 16:24:52.826707: W external/local_xla/xla/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 81285120 exceeds 10% of free system memory.
2025-03-05 16:24:52.926557: W external/local_xla/xla/tsl/framework/cpu_allocator_impl.cc:83] Allocation of 81285120 exceeds 10% of free system memory.
17/17 ━━━━━━━━━━━━━━━━━━━━ 64s 2s/step - loss: 10.6107 - mae: 2.4767 - val_loss: 5.8343 - val_mae: 1.9866
Epoch 2/10
17/17 ━━━━━━━━━━━━━━━━━━━━ 6s 334ms/step - loss: 5.7615 - mae: 1.7735 - val_loss: 3.4834 - val_mae: 1.3618
Epoch 3/10
17/17 ━━━━━━━━━━━━━━━━━━━━ 6s 325ms/step - loss: 3.7223 - mae: 1.3728 - val_loss: 2.6208 - val_mae: 1.0280
Epoch 4/10
17/17 ━━━━━━━━━━━━━━━━━━━━ 6s 328ms/step - loss: 3.2872 - mae: 1.2143 - val_loss: 2.3364 - val_mae: 0.8743
Epoch 5/10
17/17 ━━━━━━━━━━━━━━━━━━━━ 6s 327ms/step - loss: 2.7348 - mae: 1.1827 - val_loss: 2.1418 - val_mae: 0.8723
Epoch 6/10
17/17 ━━━━━━━━━━━━━━━━━━━━ 6s 325ms/step - loss: 2.3446 - mae: 1.0587 - val_loss: 1.8281 - val_mae: 0.7991
Epoch 7/10
17/17 ━━━━━━━━━━━━━━━━━━━━ 6s 324ms/step - loss: 3.0406 - mae: 1.1755 - val_loss: 1.9083 - val_mae: 0.8441
Epoch 8/10
17/17 ━━━━━━━━━━━━━━━━━━━━ 6s 348ms/step - loss: 2.5729 - mae: 1.1094 - val_loss: 1.6396 - val_mae: 0.7463
Epoch 9/10
17/17 ━━━━━━━━━━━━━━━━━━━━ 6s 336ms/step - loss: 2.3070 - mae: 1.0819 - val_loss: 1.5650 - val_mae: 0.8390
Epoch 10/10
17/17 ━━━━━━━━━━━━━━━━━━━━ 6s 327ms/step - loss: 1.9772 - mae: 0.9897 - val_loss: 1.3946 - val_mae: 0.7119
Out[31]:
<keras.src.callbacks.history.History at 0x7c670a8ff880>
In [ ]:
# import tensorflow as tf
# from xgboost import XGBRegressor

# # 1. DEFINE THE MODEL USING FUNCTIONAL API
# def create_cnn_feature_extractor():
#     inputs = tf.keras.Input(shape=(224, 224, 3)) # Explicitly define input 
#     base_model = DenseNet121(weights='imagenet', include_top=False, input_tensor=inputs)
#     x = GlobalAveragePooling2D()(base_model.output)
#     x = Dense(64, activation='relu')(x)
#     x = Dropout(0.2)(x)
#     outputs = Dense(1, activation='linear')(x)
    
#     model = tf.keras.Model(inputs=inputs, outputs=outputs)
#     return model

# # 2. CREATE AND COMPILE THE MODEL
# cnn_model = create_cnn_feature_extractor()
# cnn_model.compile(optimizer='adam', loss='mse', metrics=['mae'])

# # 3. RUN THE MODEL ONCE TO DEFINE THE INPUT SHAPE
# dummy_input = np.random.rand(1, 224, 224, 3)  # Rastgele bir input örneği
# _ = cnn_model.predict(dummy_input)

# # 4. CREATE FEATURE EXTRACTOR
# feature_extractor = tf.keras.Model(
#     inputs=cnn_model.input,
#     outputs=cnn_model.layers[-4].output  # GlobalAveragePooling2D katmanı
# )ImageDataGenerator

# # 5. EXTRACT FEATURES
# X_train_features = feature_extractor.predict(X_train, batch_size=8)
# X_test_features = feature_extractor.predict(X_test, batch_size=8)

# # 6. TRAIN XGBOOST REGRESSOR
# xgb_hybrid = XGBRegressor(
#     n_estimators=200,
#     learning_rate=0.05,
#     max_depth=5,
#     subsample=0.8
# )
# xgb_hybrid.fit(X_train_features, y_train)

Evaluating Model Performance Using MSE, MAE, and R²¶

In [ ]:
def evaluate_model(model, X_test, y_test, model_name):
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    mae = mean_absolute_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    print(f"{model_name} -> MSE: {mse:.4f}, MAE: {mae:.4f}, R²: {r2:.4f}")
    return y_pred

# Evaluate Models
xgb_pred = evaluate_model(xgb_model, X_test_flat, y_test, "XGBoost")
rf_pred = evaluate_model(rf_model, X_test_flat, y_test, "Random Forest")
svr_pred = evaluate_model(svr_model, X_test_flat, y_test, "SVR")
cnn_pred = evaluate_model(cnn_model, X_test, y_test, "CNN (DenseNet)")
# hybrid_pred = evaluate_model(xgb_hybrid, X_test_features, y_test, "CNN + XGBoost")
XGBoost -> MSE: 0.0059, MAE: 0.0172, R²: 0.9990
Random Forest -> MSE: 0.0106, MAE: 0.0653, R²: 0.9982

Model Predictions vs. Ground Truth: Scatter Plot Analysis¶

In [ ]:
plt.figure(figsize=(10,5))
sns.scatterplot(x=y_test, y=xgb_pred, label='XGBoost')
sns.scatterplot(x=y_test, y=rf_pred, label='Random Forest')
sns.scatterplot(x=y_test, y=svr_pred, label='SVR')
sns.scatterplot(x=y_test, y=cnn_pred[:,0], label='CNN (DenseNet)')
# sns.scatterplot(x=y_test, y=hybrid_pred, label='CNN + XGBoost')
plt.xlabel("Actual Weight")
plt.ylabel("Predicted Weight")
plt.legend()
plt.show()
No description has been provided for this image

Model Performance Comparison: Predicted vs Actual Weights¶

In [ ]:
!pip install plotly
import numpy as np
import plotly.graph_objects as go
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Define the evaluation function
def evaluate_model(model, X_test, y_test, model_name):
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    mae = mean_absolute_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    print(f"{model_name} -> MSE: {mse:.4f}, MAE: {mae:.4f}, R²: {r2:.4f}")
    return mse, mae, r2

# Evaluate Models and Collect Metrics
models = {
    "XGBoost": (xgb_model, X_test_flat),
    "Random Forest": (rf_model, X_test_flat),
    "SVR": (svr_model, X_test_flat),
    "CNN (DenseNet)": (cnn_model, X_test),
    # "CNN + XGBoost": (xgb_hybrid, X_test_features),
}

metrics = {name: evaluate_model(model, X, y_test, name) for name, (model, X) in models.items()}

# Convert results to numpy arrays for plotting
mse_values = np.array([m[0] for m in metrics.values()])
mae_values = np.array([m[1] for m in metrics.values()])
r2_values = np.array([m[2] for m in metrics.values()])
model_names = list(metrics.keys())

# Create interactive bar charts with Plotly
fig = go.Figure()

# MSE Plot
fig.add_trace(go.Bar(
    y=model_names, 
    x=mse_values, 
    name="MSE (Lower is Better)", 
    orientation='h', 
    marker=dict(color='skyblue')
))

# MAE Plot
fig.add_trace(go.Bar(
    y=model_names, 
    x=mae_values, 
    name="MAE (Lower is Better)", 
    orientation='h', 
    marker=dict(color='salmon')
))

# R² Score Plot
fig.add_trace(go.Bar(
    y=model_names, 
    x=r2_values, 
    name="R² Score (Higher is Better)", 
    orientation='h', 
    marker=dict(color='lightgreen')
))

# Update layout for better aesthetics and labeling
fig.update_layout(
    title="Model Performance Comparison",
    xaxis_title="Metric Value",
    yaxis_title="Models",
    barmode='group',  # Grouped bars for side-by-side comparison
    template="plotly_dark",  # Dark theme
    height=600
)

# Show the plot
fig.show()
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: plotly in ./.local/lib/python3.10/site-packages (6.0.0)
Requirement already satisfied: narwhals>=1.15.1 in ./.local/lib/python3.10/site-packages (from plotly) (1.29.0)
Requirement already satisfied: packaging in ./.local/lib/python3.10/site-packages (from plotly) (24.2)
XGBoost -> MSE: 0.0059, MAE: 0.0172, R²: 0.9990
Random Forest -> MSE: 0.0173, MAE: 0.0751, R²: 0.9970
SVR -> MSE: 2.1956, MAE: 0.7386, R²: 0.6221
5/5 ━━━━━━━━━━━━━━━━━━━━ 0s 79ms/step
CNN (DenseNet) -> MSE: 29.7716, MAE: 4.7549, R²: -4.1244
CNN + XGBoost -> MSE: 0.1081, MAE: 0.0658, R²: 0.9814

Model Performance Evaluation¶

Based on the following metrics: MSE (Mean Squared Error), MAE (Mean Absolute Error), and R² (Coefficient of Determination), we can evaluate the performance of each model.

Evaluation Metrics:¶

  • MSE: Lower is better. Measures the average squared difference between predicted and actual values.
  • MAE: Lower is better. Measures the average absolute difference between predicted and actual values.
  • R²: Higher is better. Measures how well the predictions match the actual values (1 is perfect, 0 means no better than random guessing).

Model Comparison¶

Model MSE MAE R²
XGBoost 0.0059 0.0172 0.9990
Random Forest 0.0173 0.0751 0.9970
SVR 2.1956 0.7386 0.6221
CNN (DenseNet) 29.7716 4.7549 -4.1244
CNN + XGBoost 0.1081 0.0658 0.9814

Performance Analysis¶

  • Best Model: XGBoost

    • MSE: 0.0059 (Lowest, ideal)
    • MAE: 0.0172 (Lowest, ideal)
    • R²: 0.9990 (Highest, ideal)

    XGBoost performs the best with the lowest MSE, lowest MAE, and highest R² values, making it the top-performing model for this task.

  • Second Best Model: Random Forest

    • MSE: 0.0173 (Higher than XGBoost, less ideal)
    • MAE: 0.0751 (Higher than XGBoost, less ideal)
    • R²: 0.9970 (Lower than XGBoost, less ideal)

    Random Forest is a good model, but its performance lags behind XGBoost in terms of MSE, MAE, and R².

  • Worst Performing Models: SVR and CNN (DenseNet)

    • SVR:

      • MSE: 2.1956 (Very high, poor performance)
      • MAE: 0.7386 (Much higher than other models)
      • R²: 0.6221 (Quite low)

      SVR has a very high MSE and low R², indicating poor predictive performance.

    • CNN (DenseNet):

      • MSE: 29.7716 (Extremely high, poor performance)
      • MAE: 4.7549 (Very high, poor performance)
      • R²: -4.1244 (Negative, poor model)

      CNN (DenseNet) has a very high MSE, MAE, and a negative R², making it the least suitable model for this task.

  • Hybrid Model (CNN + XGBoost):

    • MSE: 0.1081 (Higher than XGBoost, but better than Random Forest)
    • MAE: 0.0658 (Better than Random Forest, worse than XGBoost)
    • R²: 0.9814 (Good, but lower than XGBoost)

    The Hybrid Model (CNN + XGBoost) performs well but does not outperform XGBoost as an individual model. The improvement from hybridization is minimal in this case.

Conclusion¶

  • Best Performing Model: XGBoost

    • XGBoost demonstrates the best overall performance across all evaluation metrics. It has the lowest MSE, MAE, and the highest R², making it the most reliable model for this task.
  • Second Best: Random Forest

    • Random Forest performs reasonably well but falls short of XGBoost in terms of performance metrics.
  • Worst Performing Models: SVR and CNN (DenseNet)

    • Both SVR and CNN (DenseNet) show poor performance with very high MSE, high MAE, and low R² values, making them unsuitable for this particular problem.

Final Recommendation¶

  • XGBoost should be used as the primary model for this task based on its superior performance.